-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resmgr: lifecycle overlap detection and workaround. #358
resmgr: lifecycle overlap detection and workaround. #358
Conversation
Use PrettyName()-compatible container name dumps together with container IDs in NRI evnet dumps. Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Having loggers embedded in resmgr types as members was a bad idea. Replace them with module global logger instances. Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
0d2f646
to
c4b6aba
Compare
We recently discovered a problem with the generated stream of container lifecycle events with some runtime versions. A side effect of this is that we get Create/Stop events for multiple container instances with seemingly overlapping lifecycle: the latter instance get created before the former one is stopped. When undetected, such a false overlap might cause overcommit of resources, with both instances temporarily using the full resource set of the container. As a workaround, we now track containers also by fully qualified name ($namespace/pod/ctr) and internally generate an event for releasing the resources if the old instance whenever we notice that a creation event would cause a duplicate instance for the same name. Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
c4b6aba
to
d551afc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
f609d93
to
934044d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the comments already on checkAllocations()
Check grants, looking for grants with stale allocations or duplicate containers (detected using fully qualified names). Dump total memory and CPU granted. Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
934044d
to
63e4cdd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We recently discovered a problem with the generated stream of container lifecycle events with some runtime versions. A side effect of this is that we get Create/Stop events for multiple container instances with seemingly overlapping lifecycle: the latter instance get created before the former one is stopped.
When undetected, such a false overlap might cause overcommit of resources, with both instances temporarily using the full resource set of the offending container. As a workaround, we now track containers also by fully qualified name ($namespace/pod/ctr) and internally generate an event for releasing the resources of the old instance whenever we notice that a creation event would cause a duplicate instance for the same name.